skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Creators/Authors contains: "Vo, Huy"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Despite continued technological improvements, measurement errors always reduce or distort the information that any real experiment can provide to quantify cellular dynamics. This problem is particularly serious for cell signaling studies to quantify heterogeneity in single-cell gene regulation, where important RNA and protein copy numbers are themselves subject to the inherently random fluctuations of biochemical reactions. Until now, it has not been clear how measurement noise should be managed in addition to other experiment design variables (e.g., sampling size, measurement times, or perturbation levels) to ensure that collected data will provide useful insights on signaling or gene expression mechanisms of interest. We propose a computational framework that takes explicit consideration of measurement errors to analyze single-cell observations, and we derive Fisher Information Matrix (FIM)-based criteria to quantify the information value of distorted experiments. We apply this framework to analyze multiple models in the context of simulated and experimental single-cell data for a reporter gene controlled by an HIV promoter. We show that the proposed approach quantitatively predicts how different types of measurement distortions affect the accuracy and precision of model identification, and we demonstrate that the effects of these distortions can be mitigated through explicit consideration during model inference. We conclude that this reformulation of the FIM could be used effectively to design single-cell experiments to optimally harvest fluctuation information while mitigating the effects of image distortion. 
    more » « less
  2. null (Ed.)
    We develop GeoMatch as a novel, scalable, and efficient big-data pipeline for large-scale map matching on Apache Spark. GeoMatch improves existing spatial big-data solutions by utilizing a novel spatial partitioning scheme inspired by Hilbert space-filling curves. Thanks to its partitioning scheme, GeoMatch can effectively balance operations across different processing units and achieve significant performance gains. GeoMatch also incorporates a dynamically adjustable error-correction technique that provides robustness against positioning errors. We demonstrate the effectiveness of GeoMatch through rigorous and extensive empirical benchmarks that consider large-scale urban spatial datasets ranging from 166,253 to 3.78B location measurements. We separately assess execution performance and accuracy of map matching and develop a benchmark framework for evaluating large-scale map matching. Results of our evaluation show up to 27.25-fold performance improvements compared to previous works while achieving better processing accuracy than current solutions. We also showcase the practical potential of GeoMatch with two urban management applications. GeoMatch and our benchmark framework are open-source. 
    more » « less
  3. Utilizing large-scale urban data sets to predict taxi and Uber passengers demand in cities is valuable for designing better taxi dispatch system and improving taxi services. In this paper, we predict taxi and Uber demand using two real-world data sets. Our approach consists of two key steps. First, we use temporal-correlated entropy to measure the demand regularity and obtain the maximum predictability. Second, we implement and assess five well-known representative predictors (Markov, LZW, ARIMA, MLP and LSTM) in achieving the maximum predictability. The results show that, on average, the maximum predictability can be as high as 83%, indicating a high temporal regularity of taxi demand in cities. In areas with low maximum predictability ( Πmax<0.83 ), the deep learning predictor LSTM can achieve high prediction accuracy by capturing hidden long-term temporal dependency. In areas with high maximum predictability ( Πmax⩾0.83 ), the Markov predictor can infer taxi demand with 86% accuracy, 14% better than LSTM, while requiring only 0.02% computation time. These findings suggest that the maximum predictability can help determine which predictor to use in terms of the accuracy and computational costs. 
    more » « less
  4. Most humans today have mobile phones. According to the GSMA, there are almost 10 billion mobile connections in the world every day. These devices automatically capture behavioral data from human society and store it in databases around the world. However, data capture has several challenges to deal with, especially if it comes from old sources. Obsolete technologies such as 2G and 3G represent two-thirds of the total devices. To the best of our knowledge, all previous work only eliminates obvious problems in the data or use well-curated data. Eliminating traces in a time series can lead to deviations and biases in further analyses, especially when we are studying small areas or groups of peoples in the city. In this work, we present two algorithms to solve the problem of the Neighboring Network Hit (NNH) and calculate the distributions of trips and traveled distances with greater precision in small areas or groups of peoples. The problem of NNH arises when a mobile device connects to cellular sites other than those defined in the network design, which complicates the analysis of space-time mobility. We use cellular device data from three cities in Chile, obtained from the mobile phone operator and duly anonymized. We compare our results with the Government's Origin and Destination Surveys and use a novel method to generate synthetic data to which errors are added in a controlled manner to evaluate the performance of our solution. We conclude that our algorithms improve results compared to naive methods, increasing the accuracy in the count of trips and, mainly, in the distance distributions. 
    more » « less
  5. As the problem of drug abuse intensifies in the U.S., many studies that primarily utilize social media data, such as postings on Twitter, to study drug abuse-related activities use machine learning as a powerful tool for text classification and filtering. However, given the wide range of topics of Twitter users, tweets related to drug abuse are rare in most of the datasets. This imbalanced data remains a major issue in building effective tweet classifiers, and is especially obvious for studies that include abuse-related slang terms. In this study, we approach this problem by designing an ensemble deep learning model that leverages both word-level and character-level features to classify abuse-related tweets. Experiments are reported on a Twitter dataset, where we can configure the percentages of the two classes (abuse vs. non-abuse) to simulate the data imbalance with different amplitudes. Results show that our ensemble deep learning models exhibit better performance than ensembles of traditional machine learning models, especially on heavily imbalanced datasets. 
    more » « less
  6. null (Ed.)